Task: Analyze public data about smart device use habits, and make recommendations for improving marketing strategies for the company.
The data is available through Kaggle. FitBit Fitness Tracker Data. It consists of 18 csv files of various sizes. The largest one is 85.4 MB, which is too large for handling in spreadsheet, but it is perfectly fine for R. I decided to use RStudio on my laptop for the data analysis.
Load some helpful library packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.4.4 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
library(RColorBrewer)
There are 18 csv files with data in them.
Reading all files into R can be done in various ways, but not all 18 are needed all at once. For this reason, instead of the following chunk reading all 18 files at once, we will read a few specific ones and the other files later whenver they are needed.
Loading all csv files in the directory using their file name as data frame name:
#files <- list.files(pattern = "\\.csv$", full.names = TRUE)
#data_list <- map(files, read_csv)
#file_names <- tools::file_path_sans_ext(basename(files))
#names(data_list) <- file_names
Read the file contents into data frames and “glimpse” at them.
activity_daily <- read_csv("./dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calories_daily <- read_csv("dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
intensities_daily <- read_csv("dailyIntensities_merged.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
steps_daily <- read_csv("dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_daily <- read_csv("./sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(activity_daily)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(calories_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
glimpse(intensities_daily)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(steps_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…
glimpse(sleep_daily)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
The dates are formatted as strings. We change them to date format. We also make sure they are all called ActivityDate for uniformity.
calories_daily <- rename(calories_daily, ActivityDate = ActivityDay)
intensities_daily <- rename(intensities_daily, ActivityDate = ActivityDay)
steps_daily <- rename(steps_daily, ActivityDate = ActivityDay)
sleep_daily <- rename(sleep_daily, ActivityDate = SleepDay)
activity_daily$ActivityDate <- mdy(activity_daily$ActivityDate)
calories_daily$ActivityDate <- mdy(calories_daily$ActivityDate)
intensities_daily$ActivityDate <- mdy(intensities_daily$ActivityDate)
steps_daily$ActivityDate <- mdy(steps_daily$ActivityDate)
sleep_daily$ActivityDate <- date(mdy_hms(sleep_daily$ActivityDate))
Let us look at the data frames again:
glimpse(activity_daily)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(calories_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 177…
glimpse(intensities_daily)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(steps_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 105…
glimpse(sleep_daily)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
We know that the Ids represent users. Let’s check how man different users there are.
activity_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
calories_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
intensities_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
steps_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
sleep_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 24 1503960366 8792009665
33 distinct users in four of the data frames, 24 distinct users in the sleep data frame. We will have to be careful handling the sleep data together with the other activity data, because the users will not match.
There are three diffferent intensities are recorded for activities: LightlyActiveMinutes; FairlyActiveMinutes; and VeryActiveMinutes. Let us add together these to get a new column called TotalActiveMinutes, and also compare each with how calories are burnt through the activities of different intensities.
ggplot(data=activity_daily, aes(x=LightlyActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Lightly Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=FairlyActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Fairly Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=VeryActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Very Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
activity_daily <- activity_daily %>%
mutate(TotalActiveMinutes = LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes)
ggplot(data=activity_daily, aes(x=TotalActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Total Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We have to be careful with the comparison because the scales on the axes are not the same. While the total active minutes show good correlation to the calories burnt, notice that burning 3,000 calories on average would require almost 500 minutes of total activities, which is more than 8 hours. Compared to that, the figure with the Very Active Minutes shows 3,000 calories burnt on average with only about 80 minutes of activities. This emphasizes the importance of intensive activities. In order to investigate this more, we will redraw the Calories vs. Total Active Minutes figure with color gradient added using the proportion of very active minutes to the total active minutes (VAMinProp = VeryActiveMinutes/TotalActiveMinutes).
activity_daily <- activity_daily %>%
mutate(VAMinProp = VeryActiveMinutes/TotalActiveMinutes)
ggplot(data=activity_daily, aes(x=TotalActiveMinutes, y=Calories)) +
geom_point(aes(colour = VAMinProp)) + scale_colour_gradient2() + geom_smooth() + labs(title="Calories vs. Total Active Minutes with Proportion of Very Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This figure also shows that short but higher intensity activities have more significant calory burning effects. This is very important information for people with sedentary work and life style, who might have limited time for exercises.
They are in separate data frames. Let us merge them by the Ids and the dates, and then plot the calories vs. the number of steps taken.
calories_steps <- merge(calories_daily, steps_daily, by=c('Id', 'ActivityDate'))
glimpse(calories_steps)
## Rows: 940
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 177…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 105…
ggplot(data=calories_steps, aes(x=StepTotal, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories Burnt vs. Total Steps")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As before, and not surprisingly, the figure shows good correlation between the calories burnt and the total number of stops. The issues is again, that it takes 20,000 steps to burn on average 3,000 calories. This is quite a lot os steps from my personal experience. On the other hand, there is a large deviation in the burnt calories, espeically in the range of 10,000-15,000 steps. Let us go back to the intensities and see how the distances taken during the various intensities correlate to the calories burnt.
ggplot(data=activity_daily, aes(x=LightActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Lightly Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=ModeratelyActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Moderately Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=VeryActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Very Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=TotalDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Total Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activity_daily$LightActiveDistance, activity_daily$Calories)
## [1] 0.4669168
cor(activity_daily$ModeratelyActiveDistance, activity_daily$Calories)
## [1] 0.2167899
cor(activity_daily$VeryActiveDistance, activity_daily$Calories)
## [1] 0.4919586
cor(activity_daily$TotalDistance, activity_daily$Calories)
## [1] 0.6449619
There is strong correlation (0.645) only with the total distance
Let us see if it makes any difference if we take into account the proportion of steps taken during very activities with respect to the total distance (VADistProp).
activity_daily <- activity_daily %>%
mutate(VADistProp = VeryActiveDistance/TotalDistance)
ggplot(data=activity_daily, aes(x=TotalDistance, y=Calories)) +
geom_point(aes(colour = VADistProp)) + scale_colour_gradient2() + geom_smooth() + labs(title="Calories vs. Total Distance with Proportion of Very Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This figure does not show that small distance but with high intensity is better than long distance with low intensity. The reason for this must be that high intensity activity exercises sometimes result in smaller distance taken. With other words, someone who took 5 miles of a total distance might have taken most of this distance with moderate or light activities (hence the white color), but still can burn a lot of calories with high intensity stationary activities. The figure does show, however, that longer distances were done mostly during very intensive activities.
We compare the TotalMinutesAsleep to the TotalTimeInBed. How is sleep time related to bed time?
ggplot(data=sleep_daily, aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep Time vs. Bed Time")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Very can see nice linear relationship. The bulk of the data shows that people sleep not less than an hour than the amount of time they spend in bed. This is good information. It suggests that if we want to sleep more with have to simply spend more time in bed. It is somewhat concerning that there are several data points showing people who sleep less than 4 hours or more than 10 hours or stays in bed more than 14 hours. Let us remove (filter out) those data points as outliers.
sleep_daily %>% filter(TotalMinutesAsleep > 240 & TotalMinutesAsleep < 600 & TotalTimeInBed <840) %>%
ggplot(aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep Time vs. Bed Time with outliers removed")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We merge the activity_daily and sleep_daily data frames using inner merge on the Id’s and the ActivityDate. Remember that there are 33 distinct people in the activity data frame but only 24 in the sleep data frame.
activities_sleep <- merge(activity_daily, sleep_daily, by=c('Id', 'ActivityDate'))
glimpse(activities_sleep)
## Rows: 413
## Columns: 21
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ TotalSteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ TotalDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ FairlyActiveMinutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ LightlyActiveMinutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ SedentaryMinutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ Calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ TotalActiveMinutes <dbl> 366, 257, 272, 267, 222, 345, 245, 238, 324, …
## $ VAMinProp <dbl> 0.06830601, 0.08171206, 0.10661765, 0.1348314…
## $ VADistProp <dbl> 0.2211765, 0.2252511, 0.3407643, 0.3321079, 0…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …
activities_sleep %>% summarise(n_distinct(Id), min(Id), max(Id))
## n_distinct(Id) min(Id) max(Id)
## 1 24 1503960366 8792009665
Let us plot see if burning more calories results in more sleep by plotting TotalMinutesAsleep vs. Calories.
ggplot(data=activities_sleep, aes(x=TotalMinutesAsleep, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Sleep vs. Activities")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep$Calories,activities_sleep$TotalMinutesAsleep)
## [1] -0.02852571
Apparently, sleeping and the amount of calories burnt are not correlated.
Let us look at the connection between sedentary minutes and total minutes asleep.
ggplot(data=activities_sleep, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep vs. SedentaryMinutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep$SedentaryMinutes,activities_sleep$TotalMinutesAsleep)
## [1] -0.599394
It looks like there is a negative correlation but only for more than 8 hours (480 minutes) of Sedentary time. This is interesting data the could be investigated more, and an app could take it into account with possible warnings, that neither very low nor high sedentary time helps with sleep.
Let us look at filtered data that has more than 8 hours (480 minutes) of sedentary time.
activities_sleep_filtered <- activities_sleep %>% filter(SedentaryMinutes > 480)
ggplot(activities_sleep_filtered, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() +
labs(title="Sleep vs. SedentaryMinutes for SedentaryMinutes > 480")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep_filtered$SedentaryMinutes, activities_sleep_filtered$TotalMinutesAsleep)
## [1] -0.6809041
Data with more than 8 hours (480 minutes) of sedentary time shows a stronger negative correlation between sedentary time and sleep time.
Let us load and look at the weightLogInfo_merged file:
weightLog <- read_csv("./weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(weightLog)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
weightLog %>% select(BMI) %>% summary()
## BMI
## Min. :21.45
## 1st Qu.:23.96
## Median :24.39
## Mean :25.19
## 3rd Qu.:25.56
## Max. :47.54
n_distinct(weightLog$Id)
## [1] 8
Note that the mean BMI is 25.19, which indicates slightly overwheight persons. In any case, the 67 total observations comming from 8 persons total is a very small data set. This data is not enough for drawing conclusions from it.
Let us load and look at the hourlyIntensities_merged file:
hourlyIntensities <- read_csv("./hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(hourlyIntensities)
## Rows: 22,099
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/1…
## $ TotalIntensity <dbl> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
The column ActivityHour contains both date and time together as a character string. We have to separate them in order to examine the times of the day when people do more activities. We create a column called ActivityDay and store only hour of the day in the already existing column ActivityHour.
hourlyIntensities <- hourlyIntensities %>%
mutate(ActivityDay = date(mdy_hms(hourlyIntensities$ActivityHour)) )
hourlyIntensities <- hourlyIntensities %>%
mutate(ActivityHour = hour(mdy_hms(hourlyIntensities$ActivityHour)))
glimpse(hourlyIntensities)
## Rows: 22,099
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ TotalIntensity <dbl> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
## $ ActivityDay <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, 2016…
Create a data frame TothourlyIntensities that has intensities grouped by the hour of the day.
TothourlyIntensities <- hourlyIntensities %>%
group_by(ActivityHour) %>%
summarise(total_intensity = sum(TotalIntensity))
glimpse(TothourlyIntensities)
## Rows: 24
## Columns: 2
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ total_intensity <dbl> 1989, 1324, 974, 414, 590, 4614, 7235, 9993, 13656, 14…
Let us see what time of the day people are the most active. We use the histogram for this purpose.
ggplot(data=TothourlyIntensities, aes(x=ActivityHour, y=total_intensity)) + geom_histogram(stat = "identity") +
labs(title="Total Intensity vs. Time of Day")
## Warning in geom_histogram(stat = "identity"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
People are most active between 5 and 7 p.m. and also around noon. An app could remind people to start their exercise activity around these times.
Just to confirm that the most intensive activity periods correspond to the most calories burnt, let us see when people burn the most calories using the hourlyCalories_merged.csv file.
hourlyCalories <- read_csv("./hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(hourlyCalories)
## Rows: 22,099
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
As before, we separate the date and the time of the day from the ActivityHour column; then create a data frame TothourlyCalories that has calories grouped by the hour of the day; and use a histogram to see at what time of the day people burn the most calories.
hourlyCalories <- hourlyCalories %>%
mutate(ActivityDay = date(mdy_hms(hourlyCalories$ActivityHour)) )
hourlyCalories <- hourlyCalories %>%
mutate(ActivityHour = hour(mdy_hms(hourlyCalories$ActivityHour)))
glimpse(hourlyCalories)
## Rows: 22,099
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Calories <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
## $ ActivityDay <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-…
TothourlyCalories <- hourlyCalories %>%
group_by(ActivityHour) %>%
summarise(total_calories = sum(Calories))
glimpse(TothourlyCalories)
## Rows: 24
## Columns: 2
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ total_calories <dbl> 67066, 65464, 64551, 63013, 63620, 76152, 80994, 87959,…
ggplot(data=TothourlyCalories, aes(x=ActivityHour, y=total_calories)) + geom_histogram(stat = "identity") +
labs(title="Total Calories Burnt vs. Time of Day")
## Warning in geom_histogram(stat = "identity"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
The maximums on the calories’ histogram matches the most active
periods (5-7 p.m. and around noon).
The recommendations for the company are to include in the app the following features: